The main goal of our project is to predict whether a driver will file a claim in the next year, which help the insurance company to tailor their prices. Our data is from Porto Seguro company, which is one of Brazil’s largest auto and homeowner insurance companies. Since the data involves some personal information, so the raw data we got is masked data, which means we don’t have any prior information of what these variables are, the only information is the groups(features that belong to similar groupings are tagged as such in the feature names) and the type of the variables(binary, chategorical, ect). We first did data exploration, deleted two variables with big proportion of missing data, and resampled our data to get a more balance data set. We fit five models in all, Logistic Regression, PCA+C5.0, RandomForest, Neural Networks and XGBoost. In addition, we used confusionMatrix, ROC Curve to evaluate and visualize the performances of the five models. It turned out that XGBoost is the best algorithm for this case, the prediction is almost accurate, while the other methods - Logistic Regression, PCA+C5.0, RandomForst, Neural Networks don’t have high accuracy in this case.
Our project is a kaggle competition project. The data is from Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, and all the variables are past information about the drivers. Since inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones. In this project, we want to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year. A more accurate prediction will allow the company to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers.
Prediction/Hypothesis: Whether a driver will file a claim is associated with the other variables, and it could be predicted using machine learning algorithms based on the past information.
Load all packages we will use in this project.
library(data.table)
library(tibble)
library(purrr)
library(ggplot2)
library(ROSE)
library(dplyr)
library(magrittr)
library(gridExtra)
library(corrplot)
library(caret)
library(rvest)
library(factoextra)
library(randomForest)
library(pROC)
library(ROCR)
library(C50)
library(drat)
library(xgboost)
library(neuralnet)Data Overview:
Feature names include the postfix bin to indicate binary features and cat to indicate categorical features.
The target columns signifies whether or not a claim was filed for that policy holder.
By tibble, the speed of reading data is faster. For large size data, this will be useful. We have 595212 observations and 59 variables in total.
# In this data, "-1" indicates NA.
PS <- as.tibble(fread('train.csv', na.strings=c("-1","-1.0")))##
Read 0.0% of 595212 rows
Read 50.4% of 595212 rows
Read 97.4% of 595212 rows
Read 595212 rows and 59 (of 59) columns from 0.108 GB file in 00:00:05
str(PS)## Classes 'tbl_df', 'tbl' and 'data.frame': 595212 obs. of 59 variables:
## $ id : int 7 9 13 16 17 19 20 22 26 28 ...
## $ target : int 0 0 0 0 0 0 0 0 0 1 ...
## $ ps_ind_01 : int 2 1 5 0 0 5 2 5 5 1 ...
## $ ps_ind_02_cat : int 2 1 4 1 2 1 1 1 1 1 ...
## $ ps_ind_03 : int 5 7 9 2 0 4 3 4 3 2 ...
## $ ps_ind_04_cat : int 1 0 1 0 1 0 1 0 1 0 ...
## $ ps_ind_05_cat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_06_bin : int 0 0 0 1 1 0 0 1 0 0 ...
## $ ps_ind_07_bin : int 1 0 0 0 0 0 1 0 0 1 ...
## $ ps_ind_08_bin : int 0 1 1 0 0 0 0 0 1 0 ...
## $ ps_ind_09_bin : int 0 0 0 0 0 1 0 0 0 0 ...
## $ ps_ind_10_bin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_11_bin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_12_bin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_13_bin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_14 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_15 : int 11 3 12 8 9 6 8 13 6 4 ...
## $ ps_ind_16_bin : int 0 0 1 1 1 1 1 1 1 0 ...
## $ ps_ind_17_bin : int 1 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_18_bin : int 0 1 0 0 0 0 0 0 0 1 ...
## $ ps_reg_01 : num 0.7 0.8 0 0.9 0.7 0.9 0.6 0.7 0.9 0.9 ...
## $ ps_reg_02 : num 0.2 0.4 0 0.2 0.6 1.8 0.1 0.4 0.7 1.4 ...
## $ ps_reg_03 : num 0.718 0.766 NA 0.581 0.841 ...
## $ ps_car_01_cat : int 10 11 7 7 11 10 6 11 10 11 ...
## $ ps_car_02_cat : int 1 1 1 1 1 0 1 1 1 0 ...
## $ ps_car_03_cat : int NA NA NA 0 NA NA NA 0 NA 0 ...
## $ ps_car_04_cat : int 0 0 0 0 0 0 0 0 0 1 ...
## $ ps_car_05_cat : int 1 NA NA 1 NA 0 1 0 1 0 ...
## $ ps_car_06_cat : int 4 11 14 11 14 14 11 11 14 14 ...
## $ ps_car_07_cat : int 1 1 1 1 1 1 1 1 1 1 ...
## $ ps_car_08_cat : int 0 1 1 1 1 1 1 1 1 1 ...
## $ ps_car_09_cat : int 0 2 2 3 2 0 0 2 0 2 ...
## $ ps_car_10_cat : int 1 1 1 1 1 1 1 1 1 1 ...
## $ ps_car_11_cat : int 12 19 60 104 82 104 99 30 68 104 ...
## $ ps_car_11 : int 2 3 1 1 3 2 2 3 3 2 ...
## $ ps_car_12 : num 0.4 0.316 0.316 0.374 0.316 ...
## $ ps_car_13 : num 0.884 0.619 0.642 0.543 0.566 ...
## $ ps_car_14 : num 0.371 0.389 0.347 0.295 0.365 ...
## $ ps_car_15 : num 3.61 2.45 3.32 2 2 ...
## $ ps_calc_01 : num 0.6 0.3 0.5 0.6 0.4 0.7 0.2 0.1 0.9 0.7 ...
## $ ps_calc_02 : num 0.5 0.1 0.7 0.9 0.6 0.8 0.6 0.5 0.8 0.8 ...
## $ ps_calc_03 : num 0.2 0.3 0.1 0.1 0 0.4 0.5 0.1 0.6 0.8 ...
## $ ps_calc_04 : int 3 2 2 2 2 3 2 1 3 2 ...
## $ ps_calc_05 : int 1 1 2 4 2 1 2 2 1 2 ...
## $ ps_calc_06 : int 10 9 9 7 6 8 8 7 7 8 ...
## $ ps_calc_07 : int 1 5 1 1 3 2 1 1 3 2 ...
## $ ps_calc_08 : int 10 8 8 8 10 11 8 6 9 9 ...
## $ ps_calc_09 : int 1 1 2 4 2 3 3 1 4 1 ...
## $ ps_calc_10 : int 5 7 7 2 12 8 10 13 11 11 ...
## $ ps_calc_11 : int 9 3 4 2 3 4 3 7 4 3 ...
## $ ps_calc_12 : int 1 1 2 2 1 2 0 1 2 5 ...
## $ ps_calc_13 : int 5 1 7 4 1 0 0 3 1 0 ...
## $ ps_calc_14 : int 8 9 7 9 3 9 10 6 5 6 ...
## $ ps_calc_15_bin: int 0 0 0 0 0 0 0 1 0 0 ...
## $ ps_calc_16_bin: int 1 1 1 0 0 1 1 0 1 1 ...
## $ ps_calc_17_bin: int 1 1 1 0 0 0 0 1 0 0 ...
## $ ps_calc_18_bin: int 0 0 0 0 1 1 0 0 0 0 ...
## $ ps_calc_19_bin: int 0 1 1 0 1 1 1 1 0 1 ...
## $ ps_calc_20_bin: int 1 0 0 0 0 1 0 0 1 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
dfmi<-data.frame(feature = names(PS), per_miss = map_dbl(PS, function(x) { sum(is.na(x))/length(x) }))
ggplot(data=dfmi,aes(x = feature, y = per_miss)) +
geom_bar(stat = 'identity', color = 'white', fill = '#5a64cd') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = '', y = '% missing', title = 'Missing Values by Feature') +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::percent)colnames(PS)## [1] "id" "target" "ps_ind_01" "ps_ind_02_cat"
## [5] "ps_ind_03" "ps_ind_04_cat" "ps_ind_05_cat" "ps_ind_06_bin"
## [9] "ps_ind_07_bin" "ps_ind_08_bin" "ps_ind_09_bin" "ps_ind_10_bin"
## [13] "ps_ind_11_bin" "ps_ind_12_bin" "ps_ind_13_bin" "ps_ind_14"
## [17] "ps_ind_15" "ps_ind_16_bin" "ps_ind_17_bin" "ps_ind_18_bin"
## [21] "ps_reg_01" "ps_reg_02" "ps_reg_03" "ps_car_01_cat"
## [25] "ps_car_02_cat" "ps_car_03_cat" "ps_car_04_cat" "ps_car_05_cat"
## [29] "ps_car_06_cat" "ps_car_07_cat" "ps_car_08_cat" "ps_car_09_cat"
## [33] "ps_car_10_cat" "ps_car_11_cat" "ps_car_11" "ps_car_12"
## [37] "ps_car_13" "ps_car_14" "ps_car_15" "ps_calc_01"
## [41] "ps_calc_02" "ps_calc_03" "ps_calc_04" "ps_calc_05"
## [45] "ps_calc_06" "ps_calc_07" "ps_calc_08" "ps_calc_09"
## [49] "ps_calc_10" "ps_calc_11" "ps_calc_12" "ps_calc_13"
## [53] "ps_calc_14" "ps_calc_15_bin" "ps_calc_16_bin" "ps_calc_17_bin"
## [57] "ps_calc_18_bin" "ps_calc_19_bin" "ps_calc_20_bin"
ps<-as.data.frame(PS[,-c(1,23,26,28,34)])Impute the category features by mode and numeric features by median.
sapply(ps, function(x) sum(is.na(x)))## target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat
## 0 0 216 0 83
## ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin
## 5809 0 0 0 0
## ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14
## 0 0 0 0 0
## ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01
## 0 0 0 0 0
## ps_reg_02 ps_car_01_cat ps_car_02_cat ps_car_04_cat ps_car_06_cat
## 0 107 5 0 0
## ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11
## 11489 0 569 0 5
## ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01
## 1 0 42620 0 0
## ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06
## 0 0 0 0 0
## ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11
## 0 0 0 0 0
## ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin
## 0 0 0 0 0
## ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
## 0 0 0 0
mode <- function (x, na.rm) {
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
if (length(xmode) > 1) xmode <- ">1 mode"
return(xmode)
}
ps$ps_ind_05_cat[is.na(ps$ps_ind_05_cat)] <- mode(ps$ps_ind_05_cat, na.rm = TRUE)
ps$ps_car_07_cat[is.na(ps$ps_car_07_cat)] <-mode(ps$ps_car_07_cat, na.rm = TRUE)
ps$ps_car_14[is.na(ps$ps_car_14)] <- median(ps$ps_car_14, na.rm = TRUE)
ps$ps_ind_02_cat[is.na(ps$ps_ind_02_cat)] <- mode(ps$ps_ind_02_cat, na.rm = TRUE)
ps$ps_car_01_cat[is.na(ps$ps_car_01_cat)] <- mode(ps$ps_car_01_cat, na.rm = TRUE)
ps$ps_ind_04_cat[is.na(ps$ps_ind_04_cat)] <- mode(ps$ps_ind_04_cat, na.rm = TRUE)
ps$ps_car_02_cat[is.na(ps$ps_car_02_cat)] <- mode(ps$ps_car_02_cat, na.rm = TRUE)
ps$ps_car_11[is.na(ps$ps_car_11)] <- median(ps$ps_car_11, na.rm = TRUE)
ps$ps_car_12[is.na(ps$ps_car_12)] <- median(ps$ps_car_12, na.rm = TRUE)
ps$ps_car_09_cat[is.na(ps$ps_car_09_cat)] <- mode(ps$ps_car_09_cat, na.rm = TRUE)
ps$ps_ind_05_cat[is.na(ps$ps_ind_05_cat)] <- mode(ps$ps_ind_05_cat, na.rm = TRUE)
ps$ps_car_07_cat[is.na(ps$ps_car_07_cat)] <- mode(ps$ps_car_07_cat, na.rm = TRUE)
table(is.na(ps))##
## FALSE
## 32141448
sapply(ps, function(x) sum(is.na(x)))## target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat
## 0 0 0 0 0
## ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin
## 0 0 0 0 0
## ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14
## 0 0 0 0 0
## ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01
## 0 0 0 0 0
## ps_reg_02 ps_car_01_cat ps_car_02_cat ps_car_04_cat ps_car_06_cat
## 0 0 0 0 0
## ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11
## 0 0 0 0 0
## ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01
## 0 0 0 0 0
## ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06
## 0 0 0 0 0
## ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11
## 0 0 0 0 0
## ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin
## 0 0 0 0 0
## ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
## 0 0 0 0
The target column which we want to predict is unbalance, so we use ovun.sample() to get balance data result. It is necessary to balanced data before applying a machine learning algorithm. If we don’t balance it in this case, the algorithm gets biased toward the majority class and fails to map minority class.
We have also compared the performance of the model using original data and balanced sampling data, the latter one has better performance.
Check the original distribution of the target. There is only 3.7% observation with target = 1, that is unbalance.
ggplot(data = PS, aes(x = as.factor(target))) +
geom_bar(fill = "lightblue") +
labs(title = 'Distribution of Target Class (1 = claim filed)',x='target') +
theme(plot.title = element_text(hjust = 0.5))table(ps$target)/nrow(ps)##
## 0 1
## 0.96355248 0.03644752
Since the operating speed of different models are quite different, and we tend to using the biggest size of data considering the operating time on our own computer.
We rebalance our data with 70% of target=0, 30% with target=1.
pddata0 <- ovun.sample(target~.,data=ps,method = "both", p = .3, N = 2000, seed = 1)$data
pddata <- as.data.frame(pddata0)
table(pddata0$target)/nrow(pddata0)##
## 0 1
## 0.6975 0.3025
pddata0 <- pddata0 %>%
mutate_at(vars(ends_with("cat")), funs(factor)) %>%
mutate_at(vars(ends_with("cat")), funs(as.numeric))
str(pddata0)## 'data.frame': 2000 obs. of 54 variables:
## $ target : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_01 : int 5 0 0 0 5 4 1 1 5 2 ...
## $ ps_ind_02_cat : num 4 4 1 1 1 1 1 1 2 1 ...
## $ ps_ind_03 : int 10 7 2 10 5 8 4 7 3 6 ...
## $ ps_ind_04_cat : num 2 1 2 2 1 1 1 1 2 1 ...
## $ ps_ind_05_cat : num 1 1 1 1 1 1 1 1 1 1 ...
## $ ps_ind_06_bin : int 0 0 1 0 0 1 1 1 0 0 ...
## $ ps_ind_07_bin : int 0 0 0 1 1 0 0 0 1 1 ...
## $ ps_ind_08_bin : int 1 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_09_bin : int 0 1 0 0 0 0 0 0 0 0 ...
## $ ps_ind_10_bin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_11_bin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_12_bin : int 0 0 0 0 0 0 0 0 0 1 ...
## $ ps_ind_13_bin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ps_ind_14 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ ps_ind_15 : int 8 9 3 2 9 10 12 6 7 0 ...
## $ ps_ind_16_bin : int 1 0 0 0 1 1 1 1 1 0 ...
## $ ps_ind_17_bin : int 0 1 0 0 0 0 0 0 0 0 ...
## $ ps_ind_18_bin : int 0 0 1 1 0 0 0 0 0 1 ...
## $ ps_reg_01 : num 0.9 0.9 0.4 0.6 0.9 0.9 0.3 0.9 0.9 0.7 ...
## $ ps_reg_02 : num 1.4 1.2 0 0.7 0.3 0.4 0 0.4 1.6 1.2 ...
## $ ps_car_01_cat : num 4 4 9 4 6 9 3 1 9 4 ...
## $ ps_car_02_cat : num 1 1 2 1 1 2 2 2 1 1 ...
## $ ps_car_04_cat : num 1 1 1 1 1 1 1 1 1 9 ...
## $ ps_car_06_cat : num 16 15 12 15 2 12 2 12 2 15 ...
## $ ps_car_07_cat : num 2 2 2 2 2 2 2 2 1 2 ...
## $ ps_car_08_cat : num 2 2 2 2 2 2 2 2 2 1 ...
## $ ps_car_09_cat : num 3 3 1 3 3 1 1 1 1 3 ...
## $ ps_car_10_cat : num 2 2 2 2 2 2 2 2 2 2 ...
## $ ps_car_11 : int 2 3 2 3 2 3 2 2 3 3 ...
## $ ps_car_12 : num 0.424 0.447 0.374 0.447 0.4 ...
## $ ps_car_13 : num 0.754 0.828 0.627 0.895 0.923 ...
## $ ps_car_14 : num 0.342 0.361 0.367 0.336 0.408 ...
## $ ps_car_15 : num 2 2.83 2.65 3 3.61 ...
## $ ps_calc_01 : num 0.1 0 0.1 0.7 0.6 0 0.3 0.3 0 0.5 ...
## $ ps_calc_02 : num 0.3 0.5 0.9 0.1 0.1 0 0 0.5 0.3 0 ...
## $ ps_calc_03 : num 0.5 0.8 0.6 0.1 0.6 0.8 0.7 0.6 0.9 0.4 ...
## $ ps_calc_04 : int 2 0 5 1 3 4 3 2 3 1 ...
## $ ps_calc_05 : int 3 1 0 2 2 1 0 2 1 2 ...
## $ ps_calc_06 : int 8 8 9 8 6 7 8 10 8 6 ...
## $ ps_calc_07 : int 1 3 5 4 4 1 4 1 3 5 ...
## $ ps_calc_08 : int 9 8 10 8 9 9 11 11 9 8 ...
## $ ps_calc_09 : int 4 3 2 1 2 3 3 2 2 2 ...
## $ ps_calc_10 : int 8 7 10 9 9 4 10 9 8 8 ...
## $ ps_calc_11 : int 2 6 5 7 3 4 12 7 3 8 ...
## $ ps_calc_12 : int 3 3 0 1 2 2 1 2 1 0 ...
## $ ps_calc_13 : int 6 3 5 4 3 2 0 1 1 5 ...
## $ ps_calc_14 : int 6 11 5 5 9 13 9 7 10 11 ...
## $ ps_calc_15_bin: int 0 1 0 0 0 0 0 0 0 0 ...
## $ ps_calc_16_bin: int 0 1 1 1 1 1 1 1 0 0 ...
## $ ps_calc_17_bin: int 1 0 0 1 1 0 0 1 1 1 ...
## $ ps_calc_18_bin: int 0 0 1 0 1 0 0 0 0 0 ...
## $ ps_calc_19_bin: int 0 0 0 0 1 0 0 0 1 0 ...
## $ ps_calc_20_bin: int 0 0 0 0 0 0 0 0 1 1 ...
As showed above, the “pddata” has some variables with type of chr, so we need to change the data type, and also prepare for the two types of data set for the methods we will apply later.
pddata - all the binary and categorical variables set as factor.
pddata0 - all variables num or int.
And we also divide our data set into train and test, seperatly with 70% and 30%.
factorvariable<-function(data){
data <- data %>%
mutate_at(vars(ends_with("cat")), funs(factor)) %>%
mutate_at(vars(ends_with("bin")), funs(factor))
data <- data %>%
mutate_at((split(names(data),sapply(data, function(x) paste(class(x), collapse=" ")))$integer), funs(as.ordered))
return(data)
}
pddata$target <- as.factor(pddata0$target)
pddata<-factorvariable(pddata0)
str(pddata)## 'data.frame': 2000 obs. of 54 variables:
## $ target : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ps_ind_01 : Ord.factor w/ 8 levels "0"<"1"<"2"<"3"<..: 6 1 1 1 6 5 2 2 6 3 ...
## $ ps_ind_02_cat : Factor w/ 4 levels "1","2","3","4": 4 4 1 1 1 1 1 1 2 1 ...
## $ ps_ind_03 : Ord.factor w/ 12 levels "0"<"1"<"2"<"3"<..: 11 8 3 11 6 9 5 8 4 7 ...
## $ ps_ind_04_cat : Factor w/ 2 levels "1","2": 2 1 2 2 1 1 1 1 2 1 ...
## $ ps_ind_05_cat : Factor w/ 7 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ps_ind_06_bin : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 2 2 1 1 ...
## $ ps_ind_07_bin : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 2 2 ...
## $ ps_ind_08_bin : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
## $ ps_ind_09_bin : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
## $ ps_ind_10_bin : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ps_ind_11_bin : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ps_ind_12_bin : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ ps_ind_13_bin : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ps_ind_14 : Ord.factor w/ 3 levels "0"<"1"<"2": 1 1 1 1 1 1 1 1 1 2 ...
## $ ps_ind_15 : Ord.factor w/ 14 levels "0"<"1"<"2"<"3"<..: 9 10 4 3 10 11 13 7 8 1 ...
## $ ps_ind_16_bin : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 2 2 1 ...
## $ ps_ind_17_bin : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
## $ ps_ind_18_bin : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 2 ...
## $ ps_reg_01 : num 0.9 0.9 0.4 0.6 0.9 0.9 0.3 0.9 0.9 0.7 ...
## $ ps_reg_02 : num 1.4 1.2 0 0.7 0.3 0.4 0 0.4 1.6 1.2 ...
## $ ps_car_01_cat : Factor w/ 12 levels "1","2","3","4",..: 4 4 9 4 6 9 3 1 9 4 ...
## $ ps_car_02_cat : Factor w/ 2 levels "1","2": 1 1 2 1 1 2 2 2 1 1 ...
## $ ps_car_04_cat : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 9 ...
## $ ps_car_06_cat : Factor w/ 18 levels "1","2","3","4",..: 16 15 12 15 2 12 2 12 2 15 ...
## $ ps_car_07_cat : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 1 2 ...
## $ ps_car_08_cat : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 1 ...
## $ ps_car_09_cat : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 3 3 1 1 1 1 3 ...
## $ ps_car_10_cat : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
## $ ps_car_11 : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 3 4 3 4 3 4 3 3 4 4 ...
## $ ps_car_12 : num 0.424 0.447 0.374 0.447 0.4 ...
## $ ps_car_13 : num 0.754 0.828 0.627 0.895 0.923 ...
## $ ps_car_14 : num 0.342 0.361 0.367 0.336 0.408 ...
## $ ps_car_15 : num 2 2.83 2.65 3 3.61 ...
## $ ps_calc_01 : num 0.1 0 0.1 0.7 0.6 0 0.3 0.3 0 0.5 ...
## $ ps_calc_02 : num 0.3 0.5 0.9 0.1 0.1 0 0 0.5 0.3 0 ...
## $ ps_calc_03 : num 0.5 0.8 0.6 0.1 0.6 0.8 0.7 0.6 0.9 0.4 ...
## $ ps_calc_04 : Ord.factor w/ 6 levels "0"<"1"<"2"<"3"<..: 3 1 6 2 4 5 4 3 4 2 ...
## $ ps_calc_05 : Ord.factor w/ 7 levels "0"<"1"<"2"<"3"<..: 4 2 1 3 3 2 1 3 2 3 ...
## $ ps_calc_06 : Ord.factor w/ 8 levels "3"<"4"<"5"<"6"<..: 6 6 7 6 4 5 6 8 6 4 ...
## $ ps_calc_07 : Ord.factor w/ 9 levels "0"<"1"<"2"<"3"<..: 2 4 6 5 5 2 5 2 4 6 ...
## $ ps_calc_08 : Ord.factor w/ 9 levels "4"<"5"<"6"<"7"<..: 6 5 7 5 6 6 8 8 6 5 ...
## $ ps_calc_09 : Ord.factor w/ 8 levels "0"<"1"<"2"<"3"<..: 5 4 3 2 3 4 4 3 3 3 ...
## $ ps_calc_10 : Ord.factor w/ 20 levels "0"<"1"<"2"<"3"<..: 9 8 11 10 10 5 11 10 9 9 ...
## $ ps_calc_11 : Ord.factor w/ 16 levels "0"<"1"<"2"<"3"<..: 3 7 6 8 4 5 13 8 4 9 ...
## $ ps_calc_12 : Ord.factor w/ 8 levels "0"<"1"<"2"<"3"<..: 4 4 1 2 3 3 2 3 2 1 ...
## $ ps_calc_13 : Ord.factor w/ 11 levels "0"<"1"<"2"<"3"<..: 7 4 6 5 4 3 1 2 2 6 ...
## $ ps_calc_14 : Ord.factor w/ 19 levels "1"<"2"<"3"<"4"<..: 6 11 5 5 9 13 9 7 10 11 ...
## $ ps_calc_15_bin: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
## $ ps_calc_16_bin: Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 1 1 ...
## $ ps_calc_17_bin: Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 2 2 ...
## $ ps_calc_18_bin: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
## $ ps_calc_19_bin: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
## $ ps_calc_20_bin: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 2 ...
set.seed(123)
sample_index <- sample(nrow(pddata), replace = FALSE, size = 0.7*nrow(pddata))
train <- pddata[sample_index,]
test<-pddata[-sample_index,]
train0 <- pddata0[sample_index,]
test0<-pddata0[-sample_index,]ggplot(pddata0, aes(target)) + geom_bar(fill = "lightblue") + ggtitle("Distribution of Target") + theme(plot.title = element_text(hjust = 0.5))#ps_ind(continuous)
p1 <- ggplot(pddata0, aes(ps_ind_01)) + geom_histogram(binwidth = 1, fill = "lightblue")
p2 <- ggplot(pddata0, aes(ps_ind_03)) + geom_histogram(binwidth = 1, fill = "lightblue")
p3 <- ggplot(pddata0, aes(ps_ind_14)) + geom_histogram(binwidth = 1, fill = "lightblue")
p4 <- ggplot(pddata0, aes(ps_ind_15)) + geom_histogram(binwidth = 1, fill = "lightblue")
#ps_ind(category)
p5 <- ggplot(pddata0, aes(ps_ind_02_cat)) + geom_bar(fill = "lightblue")
p6 <- ggplot(pddata0, aes(ps_ind_04_cat)) + geom_bar(fill = "lightblue")
p7 <- ggplot(pddata0, aes(ps_ind_05_cat)) + geom_bar(fill = "lightblue")
#ps_ind(binary)
p8 <- ggplot(pddata0, aes(ps_ind_06_bin)) + geom_bar(fill = "lightblue")
p9 <- ggplot(pddata0, aes(ps_ind_07_bin)) + geom_bar(fill = "lightblue")
p10 <- ggplot(pddata0, aes(ps_ind_08_bin)) + geom_bar(fill = "lightblue")
p11 <- ggplot(pddata0, aes(ps_ind_09_bin)) + geom_bar(fill = "lightblue")
p12 <- ggplot(pddata0, aes(ps_ind_10_bin)) + geom_bar(fill = "lightblue")
p13 <- ggplot(pddata0, aes(ps_ind_11_bin)) + geom_bar(fill = "lightblue")
p14 <- ggplot(pddata0, aes(ps_ind_12_bin)) + geom_bar(fill = "lightblue")
p15 <- ggplot(pddata0, aes(ps_ind_13_bin)) + geom_bar(fill = "lightblue")
p16 <- ggplot(pddata0, aes(ps_ind_16_bin)) + geom_bar(fill = "lightblue")
p17 <- ggplot(pddata0, aes(ps_ind_17_bin)) + geom_bar(fill = "lightblue")
p18 <- ggplot(pddata0, aes(ps_ind_18_bin)) + geom_bar(fill = "lightblue")
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, nrow = 5)#ps_reg
p1 <- ggplot(pddata0, aes(ps_reg_01)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p2 <- ggplot(pddata0, aes(ps_reg_02)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
grid.arrange(p1, p2, nrow = 1)#ps_car(continuous)
p1 <- ggplot(pddata0, aes(ps_car_11)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p2 <- ggplot(pddata0, aes(ps_car_12)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p3 <- ggplot(pddata0, aes(ps_car_13)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p4 <- ggplot(pddata0, aes(ps_car_14)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p5 <- ggplot(pddata0, aes(ps_car_15)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
#ps_car(category)
p6 <- ggplot(pddata0, aes(ps_car_01_cat)) + geom_bar(fill = "lightblue")
p7 <- ggplot(pddata0, aes(ps_car_02_cat)) + geom_bar(fill = "lightblue")
p9 <- ggplot(pddata0, aes(ps_car_04_cat)) + geom_bar(fill = "lightblue")
p11 <- ggplot(pddata0, aes(ps_car_06_cat)) + geom_bar(fill = "lightblue")
p12 <- ggplot(pddata0, aes(ps_car_07_cat)) + geom_bar(fill = "lightblue")
p13 <- ggplot(pddata0, aes(ps_car_08_cat)) + geom_bar(fill = "lightblue")
p14 <- ggplot(pddata0, aes(ps_car_09_cat)) + geom_bar(fill = "lightblue")
p15 <- ggplot(pddata0, aes(ps_car_10_cat)) + geom_bar(fill = "lightblue")
grid.arrange(p1, p2, p3, p4, p5,p6, p7, p9, p11, p12, p13, p14, p15, nrow = 4)#ps_calc(continuous)
p1 <- ggplot(pddata0, aes(ps_calc_01)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p2 <- ggplot(pddata0, aes(ps_calc_02)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p3 <- ggplot(pddata0, aes(ps_calc_03)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p4 <- ggplot(pddata0, aes(ps_calc_04)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p5 <- ggplot(pddata0, aes(ps_calc_05)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p6 <- ggplot(pddata0, aes(ps_calc_06)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p7 <- ggplot(pddata0, aes(ps_calc_07)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p8 <- ggplot(pddata0, aes(ps_calc_08)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p9 <- ggplot(pddata0, aes(ps_calc_09)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p10 <- ggplot(pddata0, aes(ps_calc_10)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p11 <- ggplot(pddata0, aes(ps_calc_11)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p12 <- ggplot(pddata0, aes(ps_calc_12)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p13 <- ggplot(pddata0, aes(ps_calc_13)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
p14 <- ggplot(pddata0, aes(ps_calc_14)) + geom_histogram(binwidth = 0.5, fill = "lightblue")
#ps_calc(binary)
p15 <- ggplot(pddata0, aes(ps_calc_15_bin)) + geom_bar(fill = "lightblue")
p16 <- ggplot(pddata0, aes(ps_calc_16_bin)) + geom_bar(fill = "lightblue")
p17 <- ggplot(pddata0, aes(ps_calc_17_bin)) + geom_bar(fill = "lightblue")
p18 <- ggplot(pddata0, aes(ps_calc_18_bin)) + geom_bar(fill = "lightblue")
p19 <- ggplot(pddata0, aes(ps_calc_19_bin)) + geom_bar(fill = "lightblue")
p20 <- ggplot(pddata0, aes(ps_calc_20_bin)) + geom_bar(fill = "lightblue")
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, nrow = 5)Checking the correlations between variables. corrplot is a good way to visualize the correlation between variables.
M <- cor(pddata0[1:36]) # the latter variables has little correlation with others
corrplot(M, method = "circle", title = "Correlation", tl.cex = 0.5, tl.col = 'black', mar=c(1, 1, 1, 1))cor(ps$ps_ind_14, ps$ps_ind_12_bin)## [1] 0.8901273
cor(ps$ps_ind_14, ps$ps_ind_11_bin)## [1] 0.564903
cor(ps$ps_car_13, ps$ps_car_12)## [1] 0.672014
cor(ps$ps_ind_18_bin, ps$ps_ind_16_bin)## [1] -0.5942654
Decide to delete ps_ind_14 from the logistic model.
base.model <- glm(target ~ 1, data = train_train, family = binomial(link = 'logit'))
all.model <- glm(target ~ . , data = train_train, family = binomial(link = 'logit'))
ols_step <- step(base.model, scope = list(lower = base.model, upper = all.model), direction = 'both', k=2, trace = F)ols_step##
## Call: glm(formula = target ~ ps_car_13 + ps_ind_05_cat + ps_ind_15 +
## ps_ind_17_bin + ps_ind_06_bin + ps_car_01_cat + ps_car_07_cat +
## ps_ind_09_bin + ps_ind_03 + ps_calc_02 + ps_calc_10, family = binomial(link = "logit"),
## data = train_train)
##
## Coefficients:
## (Intercept) ps_car_13 ps_ind_05_cat ps_ind_15 ps_ind_17_bin
## 0.008193 0.881509 0.134945 -0.053724 0.485027
## ps_ind_06_bin ps_car_01_cat ps_car_07_cat ps_ind_09_bin ps_ind_03
## -0.393706 -0.040550 -0.447874 -0.207529 0.024432
## ps_calc_02 ps_calc_10
## -0.214374 -0.019019
##
## Degrees of Freedom: 3499 Total (i.e. Null); 3488 Residual
## Null Deviance: 4318
## Residual Deviance: 4124 AIC: 4148
summary(ols_step)##
## Call:
## glm(formula = target ~ ps_car_13 + ps_ind_05_cat + ps_ind_15 +
## ps_ind_17_bin + ps_ind_06_bin + ps_car_01_cat + ps_car_07_cat +
## ps_ind_09_bin + ps_ind_03 + ps_calc_02 + ps_calc_10, family = binomial(link = "logit"),
## data = train_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7316 -0.8625 -0.7077 1.2540 2.0253
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.008193 0.402808 0.020 0.98377
## ps_car_13 0.881509 0.169698 5.195 2.05e-07 ***
## ps_ind_05_cat 0.134945 0.024764 5.449 5.06e-08 ***
## ps_ind_15 -0.053724 0.010653 -5.043 4.58e-07 ***
## ps_ind_17_bin 0.485027 0.105043 4.617 3.89e-06 ***
## ps_ind_06_bin -0.393706 0.090733 -4.339 1.43e-05 ***
## ps_car_01_cat -0.040550 0.012430 -3.262 0.00111 **
## ps_car_07_cat -0.447874 0.161448 -2.774 0.00554 **
## ps_ind_09_bin -0.207529 0.109204 -1.900 0.05738 .
## ps_ind_03 0.024432 0.014255 1.714 0.08654 .
## ps_calc_02 -0.214374 0.132249 -1.621 0.10502
## ps_calc_10 -0.019019 0.012778 -1.488 0.13663
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4317.6 on 3499 degrees of freedom
## Residual deviance: 4123.9 on 3488 degrees of freedom
## AIC: 4147.9
##
## Number of Fisher Scoring iterations: 4
log_preds <- predict(ols_step, newdata = train_test[,-1], type = "response")
log_preds <- data.frame(log_preds = log_preds)
log_preds <- round(log_preds)
table(log_preds)## log_preds
## 0 1
## 1401 99
log_pred_compare <- cbind(log_preds, train_test$target)The Kappa statistic varies from 0 to 1, where:
confusionMatrix(as.factor(log_pred_compare$log_preds), as.factor(log_pred_compare$`train_test$target`))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1001 400
## 1 47 52
##
## Accuracy : 0.702
## 95% CI : (0.6781, 0.7251)
## No Information Rate : 0.6987
## P-Value [Acc > NIR] : 0.4014
##
## Kappa : 0.0902
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9552
## Specificity : 0.1150
## Pos Pred Value : 0.7145
## Neg Pred Value : 0.5253
## Prevalence : 0.6987
## Detection Rate : 0.6673
## Detection Prevalence : 0.9340
## Balanced Accuracy : 0.5351
##
## 'Positive' Class : 0
##
logi_cor<-cor(log_pred_compare$log_preds, log_pred_compare$`train_test$target`);logi_cor## [1] 0.1297273
The typical interpretation of the area under curve (AUC):
pred<-ROCR::prediction(log_pred_compare$log_preds, log_pred_compare$`train_test$target`)
roc_logi<-performance(pred,measure="tpr", x.measure="fpr")
plot(roc_logi, main="ROC curve for Logistic Regression Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)roc_logi_auc<-performance(pred, measure="auc")
roc_logi_auc@y.values## [[1]]
## [1] 0.5350985
cor(log_pred_compare$log_preds, log_pred_compare$`train_test$target`)## [1] 0.1297273
As the result of logistic regression shows, the stepwise algorithm chooses 11 variables into the model. But as the evaluation shows, this model is not robust, since accuracy, kappa or the area under ROC curve all shows that its performance is poor.
Logistic regression may not be a good model for this case, but the variable it choose could be used in other algorithm.
prin_comp <- prcomp(train0[,-1], scale. = T)fviz_pca_biplot(prin_comp, axes = c(1, 2), geom = "point",
col.ind = "black", col.var = "steelblue", label = "all",
invisible = "none", repel = F, habillage = train$target,
palette = NULL, addEllipses = TRUE, title = "PCA - Biplot")#compute standard deviation of each principal component
std_dev <- prin_comp$sdev
#compute variance
pr_var <- std_dev^2
#check variance of first 10 components
pr_var[1:10]## [1] 3.796254 2.678176 2.054988 1.882825 1.827166 1.604345 1.482933
## [8] 1.393821 1.304530 1.287704
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]## [1] 0.07162743 0.05053162 0.03877335 0.03552500 0.03447484 0.03027065
## [7] 0.02797987 0.02629852 0.02461378 0.02429631 0.02356565 0.02290595
## [13] 0.02200347 0.02139322 0.02090439 0.02086223 0.02061759 0.02040525
## [19] 0.01996466 0.01950848
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b") plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",type="b")The plot above shows that ~ 40/50 components explains around 90%+ variance in the data set. In order words, using PCA we have reduced 54 predictors to 40/50 without compromising on too many explained variance.
Build the C5.0 model on raw data for a sanity check
cmodel<-C5.0(train0[,-1],as.factor(train$target))
predrp <- predict(cmodel, test0[, -1], type="class")
target1<-as.factor(test0$target)
confusionMatrix(predrp, target1)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 315 114
## 1 103 68
##
## Accuracy : 0.6383
## 95% CI : (0.5984, 0.6768)
## No Information Rate : 0.6967
## P-Value [Acc > NIR] : 0.9991
##
## Kappa : 0.1294
## Mcnemar's Test P-Value : 0.4972
##
## Sensitivity : 0.7536
## Specificity : 0.3736
## Pos Pred Value : 0.7343
## Neg Pred Value : 0.3977
## Prevalence : 0.6967
## Detection Rate : 0.5250
## Detection Prevalence : 0.7150
## Balanced Accuracy : 0.5636
##
## 'Positive' Class : 0
##
predrp<-ROCR::prediction(predictions=as.numeric(predrp), labels=test$target)
rocrp<-performance(predrp,measure="tpr", x.measure="fpr")
plot(rocrp, main="ROC curve for C5.0 Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)rocrp_auc<-performance(predrp, measure="auc")
rocrp_auc@y.values## [[1]]
## [1] 0.5636074
add a training set with principal components
train.data <- data.frame(target = train0$target, prin_comp$x)
train.data <- train.data[,1:40]test.zspace <- predict(prin_comp, newdata=test0[, -1])
pca.train.df <- as.data.frame(prin_comp$x)Run a decision tree,transform test into PCA
pca.model <- C5.0(pca.train.df,as.factor(train0$target))
pca.test.df <- as.data.frame(test.zspace)
pca.test.df$target <- test0[, 1]
pca.pred <- predict(pca.model, pca.test.df[,-54], type="class")
confusionMatrix(pca.pred, target1)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 374 142
## 1 44 40
##
## Accuracy : 0.69
## 95% CI : (0.6513, 0.7268)
## No Information Rate : 0.6967
## P-Value [Acc > NIR] : 0.6571
##
## Kappa : 0.135
## Mcnemar's Test P-Value : 1.141e-12
##
## Sensitivity : 0.8947
## Specificity : 0.2198
## Pos Pred Value : 0.7248
## Neg Pred Value : 0.4762
## Prevalence : 0.6967
## Detection Rate : 0.6233
## Detection Prevalence : 0.8600
## Balanced Accuracy : 0.5573
##
## 'Positive' Class : 0
##
predpca<-ROCR::prediction(predictions=as.numeric(pca.pred), labels=test$target)
rocpca<-performance(predpca,measure="tpr", x.measure="fpr")
plot(rocpca, main="ROC curve for PCA+C5.0 Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)rocpca_auc<-performance(predpca, measure="auc")
rocpca_auc@y.values## [[1]]
## [1] 0.5572585
In this data, PCs don’t have a “elbow” plot, and most of the PCs explain about the same amount of variation. Thus, it’s hard to tell which PCs or factors we need to pick. And the performance of C5.0 model is not improved and even worse. Thus, the PCA is not suitable for this data analysis.
mt<-floor(sqrt(ncol(train)))
rf.fit <- randomForest(target~. , data=train,importance=TRUE,ntree=500,mtry=mt)
varImpPlot(rf.fit, cex=0.5); print(rf.fit)##
## Call:
## randomForest(formula = target ~ ., data = train, importance = TRUE, ntree = 500, mtry = mt)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 28.64%
## Confusion matrix:
## 0 1 class.error
## 0 935 42 0.04298874
## 1 359 64 0.84869976
plot(rf.fit,main="model")The select of the mtry: \[mtry=floor(\sqrt{ncol(data)})=7\]
var.imp <- data.frame(importance(rf.fit,type=2))
var.imp$Variables <- row.names(var.imp)
varord<-var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),]
imprf<-varord$Variables[1:25]
imprf## [1] "ps_car_06_cat" "ps_car_13" "ps_car_01_cat" "ps_car_14"
## [5] "ps_ind_15" "ps_calc_03" "ps_reg_02" "ps_calc_14"
## [9] "ps_ind_03" "ps_calc_10" "ps_calc_01" "ps_calc_02"
## [13] "ps_calc_11" "ps_car_15" "ps_reg_01" "ps_calc_13"
## [17] "ps_car_12" "ps_calc_07" "ps_ind_01" "ps_calc_08"
## [21] "ps_calc_06" "ps_calc_05" "ps_calc_09" "ps_ind_05_cat"
## [25] "ps_calc_04"
predrf1 <- predict(rf.fit,test,type="prob")
summary(predrf1)## 0 1
## Min. :0.2300 Min. :0.0520
## 1st Qu.:0.6120 1st Qu.:0.2455
## Median :0.6900 Median :0.3100
## Mean :0.6768 Mean :0.3232
## 3rd Qu.:0.7545 3rd Qu.:0.3880
## Max. :0.9480 Max. :0.7700
predrf2<-predict(rf.fit,test)
confusionMatrix(predrf2,test$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 396 161
## 1 22 21
##
## Accuracy : 0.695
## 95% CI : (0.6564, 0.7316)
## No Information Rate : 0.6967
## P-Value [Acc > NIR] : 0.5552
##
## Kappa : 0.08
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9474
## Specificity : 0.1154
## Pos Pred Value : 0.7110
## Neg Pred Value : 0.4884
## Prevalence : 0.6967
## Detection Rate : 0.6600
## Detection Prevalence : 0.9283
## Balanced Accuracy : 0.5314
##
## 'Positive' Class : 0
##
pred<-ROCR::prediction(predictions=predrf1[,2], labels=test$target)
roc<-performance(pred,measure="tpr", x.measure="fpr")
plot(roc, main="ROC curve for Random Forest Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)roc_auc<-performance(pred, measure="auc")
roc_auc@y.values## [[1]]
## [1] 0.6478061
As expected, the performance of Random Forest model is much better than the single decision tree since Random Forest model is an ensemble classifier which uses many decision tree models to predict the result.
fmla <- as.formula(paste("target ~ ", paste(colnames(train0[,-1]), collapse= "+")))
NN_model<-neuralnet(fmla, data=as.matrix(train0), hidden = 5,stepmax = 1e6)
NN_pred<-compute(NN_model, test0[, -1])
NN_pred_results<-ifelse(NN_pred$net.result>0.5,1,0)
confusionMatrix(as.factor(NN_pred_results),as.factor(test0$target))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 376 152
## 1 42 30
##
## Accuracy : 0.6766667
## 95% CI : (0.637605, 0.7139784)
## No Information Rate : 0.6966667
## P-Value [Acc > NIR] : 0.8662546
##
## Kappa : 0.077596
## Mcnemar's Test P-Value : 0.000000000000005046636
##
## Sensitivity : 0.8995215
## Specificity : 0.1648352
## Pos Pred Value : 0.7121212
## Neg Pred Value : 0.4166667
## Prevalence : 0.6966667
## Detection Rate : 0.6266667
## Detection Prevalence : 0.8800000
## Balanced Accuracy : 0.5321783
##
## 'Positive' Class : 0
##
prednn<-ROCR::prediction(predictions=NN_pred_results, labels=target1)
rocnn<-performance(prednn,measure="tpr", x.measure="fpr")
plot(rocnn, main="ROC curve for Neutral Network Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)rocnn_auc<-performance(prednn, measure="auc")
rocnn_auc@y.values## [[1]]
## [1] 0.532178348
The Neural Network is expected to perform better than the model before. However, the process of model building is really time-consuming that results in the difficulty of tuning parameters.
XGBoost is short for “Extreme Gradient Boosting”. XGBoost is used for supervised learning problems, where we use the training data (with multiple features) xi to predict a target variable yi. The algorithm contains two parts: training loss and regularization. The model of xgboost: tree ensembles. The tree ensemble model is a set of classification and regression trees (CART). In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification.
labels = train_train['target']
y <- labels$target
bstSparse <- xgboost(data = data.matrix(train_train[,-1]), label = y, max.depth = 30, eta = 0.05, nthread = 3, nround = 30, objective = "binary:logistic")## [1] train-error:0.034051
## [2] train-error:0.024060
## [3] train-error:0.016566
## [4] train-error:0.013366
## [5] train-error:0.009706
## [6] train-error:0.007871
## [7] train-error:0.006891
## [8] train-error:0.006149
## [9] train-error:0.005503
## [10] train-error:0.005157
## [11] train-error:0.004663
## [12] train-error:0.004180
## [13] train-error:0.003954
## [14] train-error:0.003671
## [15] train-error:0.003391
## [16] train-error:0.003200
## [17] train-error:0.003009
## [18] train-error:0.002783
## [19] train-error:0.002649
## [20] train-error:0.002403
## [21] train-error:0.002214
## [22] train-error:0.002074
## [23] train-error:0.002000
## [24] train-error:0.001923
## [25] train-error:0.001814
## [26] train-error:0.001660
## [27] train-error:0.001571
## [28] train-error:0.001523
## [29] train-error:0.001403
## [30] train-error:0.001351
xgb_pred <- predict(bstSparse, data.matrix(train_test[,-1]))
xgb_pred <- data.frame(xgb_pred)
xgb_pred <- round(xgb_pred)
xgb_pred_compare <- cbind(xgb_pred, train_test$target)confusionMatrix(as.factor(xgb_pred_compare$xgb_pred), as.factor(xgb_pred_compare$`train_test$target`))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 104446 1207
## 1 729 43618
##
## Accuracy : 0.9870933
## 95% CI : (0.9865093, 0.9876584)
## No Information Rate : 0.7011667
## P-Value [Acc > NIR] : < 0.00000000000000022204
##
## Kappa : 0.9691067
## Mcnemar's Test P-Value : < 0.00000000000000022204
##
## Sensitivity : 0.9930687
## Specificity : 0.9730731
## Pos Pred Value : 0.9885758
## Neg Pred Value : 0.9835615
## Prevalence : 0.7011667
## Detection Rate : 0.6963067
## Detection Prevalence : 0.7043533
## Balanced Accuracy : 0.9830709
##
## 'Positive' Class : 0
##
pred_xgb<-ROCR::prediction(xgb_pred_compare$xgb_pred, xgb_pred_compare$`train_test$target`)
roc_xgb<-performance(pred_xgb, measure="tpr", x.measure="fpr")
plot(roc_xgb, main="ROC curve for XGBoost Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)roc_xgb_auc<-performance(pred_xgb, measure="auc")
roc_xgb_auc@y.values## [[1]]
## [1] 0.9830708785
cor(xgb_pred_compare$xgb_pred, xgb_pred_compare$`train_test$target`)## [1] 0.9691348761
It is obvious that XGBoost model works perfect for this case. The reason for its good performance: XGBoost model runs much faster than other methods, so we apply the biggest data set to this case, which train the model better than others. And also XGBoost has sparsity-awareness, since boosted trees work especially well on categorical features, and our data set has much categorical variables. The reason why XGBoost runs so fast is that the sparse data structure and clever implementation allow XGBoost sort columns independently, this way, the sorting work can be divided up between parallel threads of CPU.
imprf #selected variables from RandomForest## [1] "ps_car_06_cat" "ps_car_13" "ps_car_01_cat" "ps_car_14"
## [5] "ps_ind_15" "ps_calc_03" "ps_reg_02" "ps_calc_14"
## [9] "ps_ind_03" "ps_calc_10" "ps_calc_01" "ps_calc_02"
## [13] "ps_calc_11" "ps_car_15" "ps_reg_01" "ps_calc_13"
## [17] "ps_car_12" "ps_calc_07" "ps_ind_01" "ps_calc_08"
## [21] "ps_calc_06" "ps_calc_05" "ps_calc_09" "ps_ind_05_cat"
## [25] "ps_calc_04"
ols_step$coefficients # from logistic regression## (Intercept) ps_car_13 ps_ind_05_cat ps_ind_15
## 0.008192848643 0.881508871955 0.134944934632 -0.053723681378
## ps_ind_17_bin ps_ind_06_bin ps_car_01_cat ps_car_07_cat
## 0.485026816911 -0.393705632853 -0.040549611039 -0.447873929583
## ps_ind_09_bin ps_ind_03 ps_calc_02 ps_calc_10
## -0.207528910331 0.024432051182 -0.214374023265 -0.019019396169
#Using the union of the two methods
sele_variable <- c("target", "ps_car_06_cat","ps_car_13","ps_car_01_cat", "ps_car_14","ps_ind_15","ps_calc_10", "ps_ind_03", "ps_reg_02","ps_calc_14","ps_calc_11","ps_calc_03","ps_calc_01","ps_calc_02","ps_calc_08","ps_calc_13","ps_calc_07" ,"ps_car_15", "ps_ind_05_cat","ps_reg_01","ps_calc_06","ps_ind_01","ps_car_12","ps_calc_05","ps_calc_09", "ps_ind_17_bin","ps_ind_15", "ps_car_07_cat")train_train_se <- train_train[,sele_variable]
train_test_se <- train_test[,sele_variable]
labels = train_train_se['target']
y <- labels$target
bstSparse_se <- xgboost(data = data.matrix(train_train_se[,-1]), label = y, max.depth = 30, eta = 0.05, nthread = 3, nround = 30, objective = "binary:logistic")## [1] train-error:0.037260
## [2] train-error:0.024817
## [3] train-error:0.021260
## [4] train-error:0.015080
## [5] train-error:0.013000
## [6] train-error:0.011526
## [7] train-error:0.010617
## [8] train-error:0.009423
## [9] train-error:0.008366
## [10] train-error:0.007534
## [11] train-error:0.006966
## [12] train-error:0.006314
## [13] train-error:0.005677
## [14] train-error:0.005100
## [15] train-error:0.004640
## [16] train-error:0.004183
## [17] train-error:0.003903
## [18] train-error:0.003529
## [19] train-error:0.003211
## [20] train-error:0.003000
## [21] train-error:0.002800
## [22] train-error:0.002566
## [23] train-error:0.002443
## [24] train-error:0.002243
## [25] train-error:0.002137
## [26] train-error:0.002023
## [27] train-error:0.001929
## [28] train-error:0.001840
## [29] train-error:0.001763
## [30] train-error:0.001674
xgb_se_pred <- predict(bstSparse_se, data.matrix(train_test_se[,-1]))
xgb_se_pred <- data.frame(xgb_se_pred)
xgb_se_pred <- round(xgb_se_pred)
xgb_se_pred_compare <- cbind(xgb_se_pred, train_test_se$target)confusionMatrix(as.factor(xgb_se_pred_compare$xgb_se_pred), as.factor(xgb_se_pred_compare$`train_test_se$target`))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 104113 1340
## 1 1062 43485
##
## Accuracy : 0.9839867
## 95% CI : (0.9833387, 0.9846159)
## No Information Rate : 0.7011667
## P-Value [Acc > NIR] : < 0.00000000000000022204
##
## Kappa : 0.9617197
## Mcnemar's Test P-Value : 0.00000001586983
##
## Sensitivity : 0.9899025
## Specificity : 0.9701060
## Pos Pred Value : 0.9872929
## Neg Pred Value : 0.9761600
## Prevalence : 0.7011667
## Detection Rate : 0.6940867
## Detection Prevalence : 0.7030200
## Balanced Accuracy : 0.9800043
##
## 'Positive' Class : 0
##
roc_se_xgb_cor <- cor(xgb_se_pred_compare$xgb_se_pred, xgb_se_pred_compare$`train_test_se$target`)
pred_se<-ROCR::prediction(xgb_se_pred_compare$xgb_se_pred, xgb_se_pred_compare$`train_test_se$target`)
roc_se_xgb<-performance(pred_se,measure="tpr", x.measure="fpr")
plot(roc_se_xgb, main="ROC curve for XGBoost Model(Selected Variables)", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)roc_se_xgb_auc<-performance(pred_se, measure="auc")
roc_se_xgb_auc@y.values## [[1]]
## [1] 0.9800042555
This model doesn’t have better performance than the previous one, possibly because variables we select is from logistic regression and Random Forest, who don’t have a very good performance, so we can doubt that using the variables from logistic and the top 25 variables sorted by random forest is not reliable enough to fully predict the target.
| Method | Accuracy | Kappa | Sensitivity | Specificity |
|---|---|---|---|---|
| Logistic Regression | 0.7023 | 0.0491 | 0.9788 | 0.0578 |
| PCA+C5.0 | 0.6883 | 0.0615 | 0.9347 | 0.1138 |
| Random Forest | 0.7533 | 0.2539 | 0.9838 | 0.2160 |
| Neural Networks | 0.7000 | 0.0064 | 0.9973 | 0.0072 |
| XGBoost | 0.9934 | 0.9842 | 0.9968 | 0.9853 |
For this ROC curve plot, the plots are angle-shape elbow, it is because the true value is binary and the prediction is also binary. For the method Random Forest, since the output is a probability, not binary, so the curve is not angle-shape elbow
| Method | ROC area |
|---|---|
| Logistic Regression | 0.5241 |
| PCA+C5.0 | 0.5242 |
| Random Forest | 0.6869 |
| Neural Networks | 0.5022 |
| XGBoost | 0.9911 |
From the evaluation table above, we can conclude that in this case, the performance of XGBoost is the best, here are some reasons why the other algorithms perform poorly:
For Logistic Regression, it performs poorly when there are non-linear relationships, the model is not naturally flexible enough to capture more complex patterns. And since we don’t have prior knowledge, we can’t skip out the variables by hand.
For Neural Networks, the reason for the poor performance is that we are only able to fit the model with one hidden layer and limited neurons. Thus, the power of this model is not employed fully. This deep learning algorithm requires a very large amount of data and much more expertise to tune,thus it is computationally intensive to train.
For the C5.0 decision model, the unconstrained, individuals trees are prone to overfitting. And this weekness can be alleviated by using ensembles. That’s the reason why Random Forest has good performance.
For PCA, we realize that the primary components of this data are not useful, which may be the result that this data features are not closed to Gaussian.
For future extension, we hope the full data (6 million) can be used to train the models rather than constrained by operation speed of computers. And, for the neural network model, more layers and neurons can be applied in the model with elegant adjustment.
I gratefully acknowledge the help of Dr. Ivo Dinov during this semester, and also the expert advice and encouragement for our final project. It is a pleasure to be a student in this class, I really enjoy this course during the semester. And it is always happy to chat with you regarding professional problems.